Course: Applied Geo-Data Science at the University of Bern (Institute of Geography)
Supervisor: Prof. Dr. Benjamin Stocker
Adviser: Dr. Koen Hufkens, Pepa Aran, Pascal Schneider
Further information: https://geco-bern.github.io/agds/
Do you have questions about the workflow? Contact the Author:
Author: Bigler Patrick (patrick.bigler1@students.unibe.ch)
Matriculation number: 20-100-178
Reference: Report Exercise 6 (Chapter 9)
This exercise is an introduction to supervised machine learning. With a realistic problem, the following goals should be trained:
Implementation of the KNN-algorithm
Discuss the bias-variance trade-off
The role of the hyper-parameter k
Supervised machine learning is a type of machine learning where the model is trained using labeled data to predict new and unseen data as accurately as possible (with the lowest error in the new data). But this approach has some challenges. For example, we need to know how good the quality of the predicted data is. Therefore, we use only a part of the labeled data for training. With the other part of the labeled data, we quantify the error. The KNN-algorithm (k nearest neighbors) will be trained to predict new data. The basis to calculate a model is distance between values of the variable. The two most important distance metrics for KNN are the euclidean distance and the Manhattan distance. With the hyper-parameter k we can fit our model. But what is the hyper-parameter k ?:
Euclidean distance:\[ \sqrt{\sum_{j=1}^P (X_{a,j} - X_{b,j})^2} \]Manhattan distance:\[ \sum_{j=1}^P |X_{a,j} - X_{b,j}| \]
Consider the following: With k = 1, we use only one neighbor and we can easily hit every target value in the training data exactly. The bias in the training data will be zero. Unfortunately, it would also include all errors. In this exercise, the target (GPP) and the predictors come from measurements and have always errors (always statistical and almost always systematic errors (measurement noise)). If we apply the model to predict new data, then we would project the error into the new data. That will cause a higher variance in the predicted data because our model is overfitted. The bias-variance trade-off in the train data shifts to the variance.
If we use a k = n, we will include many neighbors to calculate the target. The model hits almost none of the target in the training data. The prediction will not be satisfied. The RSME (root-square-mean-error) is high in the training data and will be high in the test data. We call that an underfitted model, and the bias-variance trade-off shifts to a very high bias.
The optimal k (prediction with the lowest bias) will be between this two extreme scenarios. But where is not so easy to answer. The goal is to find the model with the best generalisability, which is the model with the probably best bias-variance trade-off. This exercise shows you how we can find the best model with a KNN-algorithm.
We use the KNN-algorithm and compare the model to a multivariate linear regression model. We split our data and use 70% for training and 30% for model evaluation. Because the algorithm uses distance, we have to standardize our data. With the box-cox function, we transform our data into a normal distribution. We use set.seed for reproducibility.
The open-source program R-Studio (Version 2022.12.0+353) was used for all studies. The data handling in this program is done using the R language. We utilize the R-package “Tidyverse” and other packages to process the data effectively.
The following code chunk contains all packages we need. Important is the package “conflicted”. It enables us to choose if different functions have the same call, but do not make the same thing (a function in a certain package can have the same name as a function from another package). In this case, we will set preferences.
source("../../R/general/packages.R")
First, we get the data and save it in our repository. But we must do this only once. If the file exists, then we can read it directly. That is why we implemented an if-else statement.
name.of.file <- "../../data/re_ml_01/daily_fluxes.csv"
# If do not exists such a file, create it!
if (!file.exists(name.of.file)){
# Access to the data
url <- "https://raw.githubusercontent.com/geco-bern/agds/main/data
/FLX_CH-Dav_FLUXNET2015_FULLSET_DD_1997-2014_1-3.csv"
# Read in the data directly from URL
daily_fluxes <- read.table(url, header = TRUE, sep = ",")
# Write a CSV file in the respiratory
write_csv(daily_fluxes, "../../data/re_ml_01/daily_fluxes.csv")
# Read the file
daily_fluxes.raw <- read_csv("../../data/re_ml_01/daily_fluxes.csv")
# If exists such a file, read it only!
}else{daily_fluxes.raw <- read_csv("../../data/re_ml_01/daily_fluxes.csv")}
If we take a closer look at the file, we see that there are 334 variables. We can not continue with our usual procedure because there are too many variables. That is why we have to clean our data first.
We select only the columns we need. Further, we can see that some columns contain -9999 as a value. Our use.good.quality.only-function changed that to NA. Then we use ymd() from the lubridate package to rewrite the date in a proper way. Further, we want only columns which contain good quality. For that, we check selected columns with their quality control column. If the proportion of good measurement is less than 80%, then we overwrite the value with NA (and do not drop the row!). After that, we discard all quality-control columns. Now we can see that the variable LW_IN_F (long-wave radiation) contains about 56% NAs. This is very problematic because for model calculation we will drop all rows which contains NAs. We would lose about 56% of our data. Further, the variable TIMESTAMP contains the date. But we are not sure if we can standardize time with box-cox. That is why we do not use these variables. All other variables have good quality.
Attention: unlike here, the variables PA_F and P_F are not used in the AGDS script. Therefore, the results will be significantly different, for example RMSE, RSQ or optimal k.
# Load the function in the file
source("../../R/re_ml_01/function.use.good.quality.only.R")
# Function call to clean the data
daily_fluxes <- use.good.quality.only(daily_fluxes.raw)
# Load the function
source("../../R/general/function.visualize.na.values.R")
# Function call
visualize.na.values.without.groups(daily_fluxes)
Here we split our data into a training subset and a test subset (70% training and 30% test). After we split the data, we can calculate our model. We use all variables except TIMESTAMP and LW_IN_F. For KNN, we use k = 8 as a first try. This choice is random. The last code chunk in this workflow will show you which k is the optimal one (we do not want to spoil anything here). That is why you should use k = 8 first.
# For reproducibility (pseudo-random)
set.seed(123)
# Split 70 % to 30 %
split <- rsample::initial_split(daily_fluxes, prop = 0.7, strata = "VPD_F")
# Split the data in a training set
daily_fluxes_train <- rsample::training(split)
# And a test set
daily_fluxes_test <- rsample::testing(split)
# Load the function in the file.
source("../../R/re_ml_01/function.train.and.test.R")
# Function call for KNN with k = 8
mod.knn <- knn.model(daily_fluxes_train, 8)
# Function call for lm
mod.lm <- lm.model(daily_fluxes_train)
After we split our data, we have to make a model evaluation. The first function call is for the LM model. The function needs the model, which we have calculated in the code chunk above. This is a multivariate regression model, and therefore, we cannot improve the formula.
The second function call is for the KNN model. The knn-model has been calculated in the code chunk above. The number for k was 8. If we want a model with a different k, we have to recalculate the knn-model with the function in the code chunk above.
lm-Model
The orange line has the slope of 45°. If the model predicted all values accurately, all values would lie on this line because the fitted value would be the same as the target. The RMSE would be zero. The bias is in the train data set per definition zero.
The red line is the calculated regression line. It shows that it has a lower angle than the orange line. It follows that our model systematically underestimates the values. We also see that in the negative bias. It also shows that especially high values are more underestimated than smaller ones. If the red line also have a 45° slope in the test data set, then the bias would be zero an we would have a perfect prediction (which is in general impossible).
KNN-Model
The situation for KNN is very similar. The regression line is a little steeper than that of the lm-model. This means that KNN underestimates the values less than the lm-model. We also see that in the lower bias tan for the lm-model. But also knn tends to underestimate high values more than small values.
# Load the function
source("../../R/re_ml_01/function.evaluation.model.R")
# Function call for linear regression model
eval_model(mod = mod.lm, df_train = daily_fluxes_train, df_test = daily_fluxes_test, c("Davos (lm)"), c("Davos (lm)"))
# Function call for KNN
eval_model(mod = mod.knn, df_train = daily_fluxes_train, df_test = daily_fluxes_test, c("Davos (knn: k=8))"), c("Davos (knn: k=8)"))
Here you can see how the KNN-model depends on k. If we use k = 1, then we have perfect training data (the regression line has a 45° slope and all values lie on it. RMSE and bias are zero). But the RMSE in the test data is high (the model is overfitted). If we use k = n our model would be underfitted, and therefore the RMSE in the test data would also be high. For a visualization of the development of RSQ and RSME as a function of k, please take a look at the last visualization. For almost all k the bias is negative which means our models underestimate the target.
# Load the function into the file
source("../../R/re_ml_01/function.different.k.R")
# Define a sequence for k
my.sequence <- c(1, 8, 15, 50, 100)
# Function call
different.k(my.sequence,daily_fluxes_train,daily_fluxes_test,c("Davos"),c("Davos"))
## k-Nearest Neighbors
##
## 1910 samples
## 8 predictor
##
## Recipe steps: BoxCox, center, scale
## Resampling: None
In this short discussion part, we want to show you two main aspects of this exercise and discuss each aspect shortly. First, we need to know how to judge the models. For this, we need to know why one model can be considered better than another. How can we estimate the bias-variance trade-off? Second, we need to know how the KNN-algorithm works. For this, we need knowledge about the role of the hyper-parameter k.
Bias1:
Erroneous assumptions in the learning algorithm lead to a bias. High bias can cause an algorithm to miss the relevant relations between features and target outputs.
Variance1:
The variance is an error from sensitivity to small fluctuations in the training set. High variance can cause an algorithm to model the random noise in the training data, rather than the intended outputs.
bias-variance trade-off1:
Both bias and variance are connected to the model's complexity. Low complexity means high bias and low variance. Increased complexity means low bias and high variance. Ideally, we want a model with a low bias and a low variance. But this is often impossible and therefore we need the best trade-off. We describe our model with the terms under- and overfitting. Normally, underfitting implies high bias and low variance, and overfitting implies low bias but high variance.
lm-model:
A lm-model uses all predictors to calculate a global multivariate regression model (see re_stepwise). However, it is a linear regression. If we use the equation (see theory), then we made some assumption (linearity, normality, homoscedasticity and independence).
Obviously, a lm-model does not looks for any underlying pattern. It creates simply a linear combination with all available variables. Further, we can not optimize lm-models like KNN-models. The lm- model describes the ttarin data with all the metrics (bias, variance, or RMSE etc.). It is therefore logical that all these metrics are projected into the new data. These metrics are then relatively similar in the test data as in the train data.
KNN-model (k = 8):
The situation for KNN is different. KNN uses a “distance”. KNN does not necessarily use all predictors. It uses kneighbors to calculate a local regression. The KNN-model looks for an underlying pattern. If we optimize the KNN-model due to the hyperparameter k, then we change the metrics in the train data. We can overfit our model and get a low bias but a high variance, or we can underfit our model and get a higher bias but a lower variance.
So we can play around with these metrics, but you see that there is a trade-off between the bias and the variance. We can neither predict well with an overfitted model nor with an underfitted model. Due to this trade-off, the metrics change in the test sets for the KNN-models more than for lm-models.
We see that KNN-model has higher RSQ than lm (0.66 vs. 0.61) and lower RSME (1.51 vs. 1.63) which means KNN is the better model than lm. But be careful: RMSE and RSQ in the KNN-model depends directly on k. We must optimize k first!
The KNN-model (k = 8) has a bias of -0.06 and lm-model has bias of -0.1 which means the lm-model has a lower model complexity than KNN-model. It follows that lm-model is more on the bias side than KNN-model. We also see that in RSQ. Lm-model has a
But now we know, that optimal k is 15. Therefore, KNN with k = 8 has not the best trade-off and it is also on the bias side (but lesser than lm-model). The other KNN-models (k =50 and k = 100) are on the variance side in the trade-off.
lm-model:
First, there is a annual cycle. We also see that the prediction varies in quality. There are years with a lower residue than other years. The model underestimate the target for a certain season and overestimate for the opposite season. We want to know more. We plot the residue as a function of the day of the year. Now we see, that our lm-model underestimate the target in winter and overestimate it in summer. Because the target is GPP (Gross Primary Productivity) it makes totally sense.
knn-model:
We make the same thing for the KNN-model. The results is very similar to the lm-model. The model underestimate also the winter target and overestimate the values during summer.
# Load the function into the file
source("../../R/re_ml_02/function.vis.residue.R")
# Function call for KNN with k = 8
mod.knn.optimal <- knn.model(daily_fluxes_train, 15)
# Residue for Davos with lm (year)
time.variation.year(mod.lm, daily_fluxes_test,
c("multiple linear regression (lm)"),
c("re_ml_1 (Chapter 9)"))
# Residue for Davos with lm (day of the year)
time.variation.year(mod.lm, daily_fluxes_test,
c("multiple linear regression (lm)"),
c("re_ml_1 (Chapter 9)"), annual = FALSE)
# Residue for Davos with KNNn (year)
time.variation.year(mod.knn.optimal, daily_fluxes_test,
c("KNN with: Train = Davos, Test = Davos, k = 15"),
c("re_ml_1 (Chapter 9)"))
# We want a better resolution of the residue of Davos (daily). We use plot = false to return a tibble. After that, we can pipe it
time.variation.year(mod.knn.optimal, daily_fluxes_test,
c("KNN with: Train = Davos, Test = Davos, k = 15"),
c("re_ml_1 (Chapter 9)"), annual = FALSE)
The k in KNN is a hyper-parameter that refers to the number of nearest neighbors. To quantify the “errors”, we use MAE instead of RMSE. because it is more related to the variance. Although the values are different, their flows are synchronous. That means that both MAE and RMSE are hyperbolas and have their global minima at the same k.
Hypothesis:
“The lower we chose k, the higher the RSQ in the training data, but the higher the variance in the test data.”
Explanation: Let k = 1. Then the training data would be
perfectly predicted because only one
(the nearest) neighbor would be taken into account. We would hit every
target value. The bias would be zero, and therefore RSQ = 1, and the MAE
is 0 as well. But the model would be totally overfitted. That means
there is a bias-variance trade-off. If the bias in the training data is
low, then the variance in the test data is, in general, high.
Let k = N. Then it would follow: \(RSQ\in(0,1]\) and \(MAE\in[0, \infty)\). But we do not know where it is. However, the model tends to be underfitted because it takes all neighbors into account. Therefore, the prediction would be the mean value of the observations.
Conclusion: We want a model with zero bias and zero variance. But this seems impossible, so we have to find the model with the best bias-variance-trade-off.
The hypothesis is not fully true. RSQ is a quantity to describe the variance. Or better, it describes how much of the variance explains the model. For the best bias-variance trade-off, we want k where the error in the test data is lowest (low MAE).
The model for k = 1 is totally overfitted. In the training data, RSQ is 1 and it follows that MAE (or RSME) must be 0. But in the test data, MAE is at a global maximum. The model is useless for any predictions, and the trade-off shifts to too much variance.
If we slowly increase k, then increase MAE and decrease RSQ on the train set. In the test data RSQ increase and MAE decrease. If we exceed optimal k, then RSQ decrease and MAE increase in the test set. We see, there mus be a optimal k.
For k = 100 the RSQ is lowest in the training data. The MAE increased for both data sets but is not at a maximum. The model is not suitable for predictions because the trade-off is on the side of too much bias.
Long story short: If k is to the left of optimal k, then the model is overfitted. If k is to the right of optimal k, then it is underfitted.
The optimal k is there, where the trade-off is best. Generalisability means that we want the model with the lowest MAE in the test data (and so the lowest bias) and the highest RSQ (low variance). For that, we mark the minimum of the RMSE test data hyperbola (green point = 15). Unfortunately, k = 15 is not necessarily the best model. With our function we have only seen that it is between 10 and 20. If we make another function call and use all k between 10 and 20, then we would see that optimal k is 15. We choose this hyper-parameter to calculate our model.
Important: Here we set a seed and split our data only pseudo-randomly. If we do not use set.seed(123), then it would be another optimal k.
# Load the function into the file
source("../../R/re_ml_01/function.parameter.extracter.R")
# Define a sequence for k. Use 1,2,3,4 to show the curve at the beginning
my.sequence <- c(1, 2, 3, 4, seq(5, to = 100, by = 5))
# Visualize the MAE and RSQ --> optimal k is 15
parameter.extracter(my.sequence, daily_fluxes_train, daily_fluxes_test,
c("re_ml_01 (Chapter 9"))